721 research outputs found
Mandarin Singing Voice Synthesis Based on Harmonic Plus Noise Model and Singing Expression Analysis
The purpose of this study is to investigate how humans interpret musical
scores expressively, and then design machines that sing like humans. We
consider six factors that have a strong influence on the expression of human
singing. The factors are related to the acoustic, phonetic, and musical
features of a real singing signal. Given real singing voices recorded following
the MIDI scores and lyrics, our analysis module can extract the expression
parameters from the real singing signals semi-automatically. The expression
parameters are used to control the singing voice synthesis (SVS) system for
Mandarin Chinese, which is based on the harmonic plus noise model (HNM). The
results of perceptual experiments show that integrating the expression factors
into the SVS system yields a notable improvement in perceptual naturalness,
clearness, and expressiveness. By one-to-one mapping of the real singing signal
and expression controls to the synthesizer, our SVS system can simulate the
interpretation of a real singer with the timbre of a speaker.Comment: 8 pages, technical repor
Affective Music Information Retrieval
Much of the appeal of music lies in its power to convey emotions/moods and to
evoke them in listeners. In consequence, the past decade witnessed a growing
interest in modeling emotions from musical signals in the music information
retrieval (MIR) community. In this article, we present a novel generative
approach to music emotion modeling, with a specific focus on the
valence-arousal (VA) dimension model of emotion. The presented generative
model, called \emph{acoustic emotion Gaussians} (AEG), better accounts for the
subjectivity of emotion perception by the use of probability distributions.
Specifically, it learns from the emotion annotations of multiple subjects a
Gaussian mixture model in the VA space with prior constraints on the
corresponding acoustic features of the training music pieces. Such a
computational framework is technically sound, capable of learning in an online
fashion, and thus applicable to a variety of applications, including
user-independent (general) and user-dependent (personalized) emotion
recognition and emotion-based music retrieval. We report evaluations of the
aforementioned applications of AEG on a larger-scale emotion-annotated corpora,
AMG1608, to demonstrate the effectiveness of AEG and to showcase how
evaluations are conducted for research on emotion-based MIR. Directions of
future work are also discussed.Comment: 40 pages, 18 figures, 5 tables, author versio
An Output-Recurrent-Neural-Network-Based Iterative Learning Control for Unknown Nonlinear Dynamic Plants
We present a design method for iterative learning control system by using an output recurrent neural network (ORNN). Two ORNNs are employed to design the learning control structure. The first ORNN, which is called the output recurrent neural controller (ORNC), is used as an iterative learning controller to achieve the learning control objective. To guarantee the convergence of learning error, some information of plant sensitivity is required to design a suitable adaptive law for the ORNC. Hence, a second ORNN, which is called the output recurrent neural identifier (ORNI), is used as an identifier to provide the required information. All the
weights of ORNC and ORNI will be tuned during the control iteration and identification process,
respectively, in order to achieve a desired learning performance. The adaptive laws for the weights
of ORNC and ORNI and the analysis of learning performances are determined via a Lyapunov
like analysis. It is shown that the identification error will asymptotically converge to zero and
repetitive output tracking error will asymptotically converge to zero except the initial resetting
error
SingNet: A Real-time Singing Voice Beat and Downbeat Tracking System
Singing voice beat and downbeat tracking posses several applications in
automatic music production, analysis and manipulation. Among them, some require
real-time processing, such as live performance processing and
auto-accompaniment for singing inputs. This task is challenging owing to the
non-trivial rhythmic and harmonic patterns in singing signals. For real-time
processing, it introduces further constraints such as inaccessibility to future
data and the impossibility to correct the previous results that are
inconsistent with the latter ones. In this paper, we introduce the first system
that tracks the beats and downbeats of singing voices in real-time.
Specifically, we propose a novel dynamic particle filtering approach that
incorporates offline historical data to correct the online inference by using a
variable number of particles. We evaluate the performance on two datasets:
GTZAN with the separated vocal tracks, and an in-house dataset with the
original vocal stems. Experimental result demonstrates that our proposed
approach outperforms the baseline by 3-5%.Comment: Accepted for 2023 International Conference on Acoustics, Speech, and
Signal Processing (ICASSP-2023
Music Source Separation with Band-Split RoPE Transformer
Music source separation (MSS) aims to separate a music recording into
multiple musically distinct stems, such as vocals, bass, drums, and more.
Recently, deep learning approaches such as convolutional neural networks (CNNs)
and recurrent neural networks (RNNs) have been used, but the improvement is
still limited. In this paper, we propose a novel frequency-domain approach
based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer
replies on a band-split module to project the input complex spectrogram into
subband-level representations, and then arranges a stack of hierarchical
Transformers to model the inner-band as well as inter-band sequences for
multi-band mask estimation. To facilitate training the model for MSS, we
propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system
trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS
track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of
BS-RoFormer on MUSDB18HQ, we achieve state-of-the-art result without extra
training data, with 9.80 dB of average SDR.Comment: This paper explains the SAMI-ByteDance MSS system submitted to Sound
Demixing Challenge (SDX23) Music Separation Trac
An Output-Recurrent-Neural-Network-Based Iterative Learning Control for Unknown Nonlinear Dynamic Plants
We present a design method for iterative learning control system by using an output recurrent neural network (ORNN). Two ORNNs are employed to design the learning control structure. The first ORNN, which is called the output recurrent neural controller (ORNC), is used as an iterative learning controller to achieve the learning control objective. To guarantee the convergence of learning error, some information of plant sensitivity is required to design a suitable adaptive law for the ORNC. Hence, a second ORNN, which is called the output recurrent neural identifier (ORNI), is used as an identifier to provide the required information. All the weights of ORNC and ORNI will be tuned during the control iteration and identification process, respectively, in order to achieve a desired learning performance. The adaptive laws for the weights of ORNC and ORNI and the analysis of learning performances are determined via a Lyapunov like analysis. It is shown that the identification error will asymptotically converge to zero and repetitive output tracking error will asymptotically converge to zero except the initial resetting error
- …